nflscrapR graphicsThis resource is modeled after the fantastic BBC Graphics Cookbook, which is also worth checking out. The nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) have compiled easy to access play-by-play stats opening a deeper world of NFL analytics for reporters, bloggers and enthusiasts (and probably some NFL teams). Ben Baldwin has compiled a quickstart guide to using this data. As such, this resource is not aimed at reproducing that tutorial, but giving you some quick guides for improving the graphics you create via ggplot2. It’s easy to get started quickly exploring the data with ggplot2 and hopefully this helps with your “publication” quality plots.
If you’d rather go deeper into a textbook and ignore specific applications related to nflscrapR, check out these amazing free online resources (some available in print as well):
| Title/Link | Author | Description |
|---|---|---|
| SocViz | Kieran Hiely | Covers exactly how to create a lot of different plot types in R/ggplot2 |
| Fundamentals of Data Viz | Claus Wilke | Covers the WHY of Data Viz where all examples are in R, but no code examples in the book, but are available on his GitHub |
| BBPlot Cookbook | BBC Data Team | Intro primer to news-style graphics in ggplot2 |
ggplot2 cookbook |
Winston Chang | Quick cookbook of ggplot2 plots |
| R Graph Gallery | Yan Holtz | Cookbook examples of a majority of plot types. |
There are a couple features that we will use throughout these examples:
dplyr::if_else()This allows you to make a binary conversion.
For example if_else(condition, true, false)
* success = if_else(epa > 0, 1, 0)
* color = if_else(posteam == "PIT", "yellow", "grey)
scale_color_identity()This is useful in combination with the above example of assigning color in a plot, essentially it will take the “yellow” or “grey” argument automatically.
scale_color_manual()This allows you to specify colors of interest like scale_color_manual(values = c("red", "black"))
forcats::reorder()This allows you to reorder levels of a ggplot by another variable.
eg reorder(posteam, epa)
teamcolors packageGives you ALL the colors for NFL teams
There are a few packages I will use in this guide, most of them related to data viz.
library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm) # beeswarm plots
library(waffle) # waffle plots
This is taken almost verbatim from Ben’s Tutorial, but the idea is that you are adjusting the dataset to be ready for analysis. If you are interested in plays beyond pass/rush then you should probably NOT do these steps.
pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# clean up the data for further analysis
pbp_rp <- pbp %>%
# grab only penalties, pass, and run plays
filter(!is.na(epa), play_type == "no_play" | play_type == "pass" | play_type == "run") %>%
# create pass, rush and success columns
mutate(
pass = if_else(str_detect(desc, "(pass)|(sacked)|(scramble)"), 1, 0),
rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
success = ifelse(epa > 0, 1, 0)
) %>%
# filter to only pass or rush plays
filter(pass == 1 | rush == 1) %>%
mutate(
passer_player_name = ifelse(play_type == "no_play" & pass == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
passer_player_name
),
receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
str_extract(
desc,
"(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
),
receiver_player_name
),
rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)| (up the middle)|(right guard)|(right tackle)|(right end)))"),
rusher_player_name
)
) %>%
mutate(
name = if_else(!is.na(passer_player_name), passer_player_name, rusher_player_name),
rusher = rusher_player_name,
receiver = receiver_player_name,
play = 1
)
This is also credited to Ben:
“Let’s look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:”
schotty <- pbp_rp %>%
filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
group_by(posteam) %>%
summarize(mean_pass = mean(pass),
plays = n()) %>%
arrange(mean_pass)
schotty
## # A tibble: 32 x 3
## posteam mean_pass plays
## <chr> <dbl> <int>
## 1 SEA 0.369 320
## 2 JAX 0.435 276
## 3 TEN 0.441 263
## 4 BUF 0.452 219
## 5 BAL 0.458 299
## 6 ARI 0.466 236
## 7 NYJ 0.473 256
## 8 DET 0.482 299
## 9 WAS 0.485 239
## 10 CAR 0.491 281
## # … with 22 more rows
“The Seahawks were playing a different sport in 2018. Fun! Let’s see what that looks like:”
ggplot(schotty, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
geom_text(aes(label = posteam))
Now this is a useful plot, but as Ben said: “This image is kind of a mess – we still need a title, axis labels, etc – but gets the point across. We’ll get to that other stuff later.”
Let’s get to that stuff now!
ggplot2 out of the box comes with a bunch of themes, things like theme_bw(), theme_minimal(), theme_classic(), and the default theme_grey().
Let’s see what they look like with the same plot as above.
theme_bw()
theme_minimal()
- Notice that we still have grey gridlines, a white background, but now no black border.
theme_classic()
But as with almost everything in R, there are more packages that add more functionality! In this case, there are entire packages dedicated to themes in ggplot2 and you have the ability to build your own themes!
library(ggthemes)
library(bbplot)
The ggthemes package gives you a wide assortment of additional themes as seen here. Most importantly it also gives you ideas about customizations to your personal theme. If you parse through the source code, you can create your own theme and utilize across your visualizations.
theme_fivethirtyeight()
theme_minimal() is ironically, minimal but the main difference is heavier grey gridlines, and a subtle grey background - which aligns with the FiveThirtyEight style.Again, the exciting part about ggthemes in my mind is the concept of creating your own theme. In fact, the code for this theme is pretty simple!
theme_fivethirtyeight <- function(base_size = 12, base_family = "sans") {
colors <- deframe(ggthemes::ggthemes_data[["fivethirtyeight"]])
(theme_foundation(base_size = base_size, base_family = base_family)
+ theme(
line = element_line(colour = "black"),
rect = element_rect(
fill = colors["Light Gray"],
linetype = 0, colour = NA
),
text = element_text(colour = colors["Dark Gray"]),
axis.title = element_blank(),
axis.text = element_text(),
axis.ticks = element_blank(),
axis.line = element_blank(),
legend.background = element_rect(),
legend.position = "bottom",
legend.direction = "horizontal",
legend.box = "vertical",
panel.grid = element_line(colour = NULL),
panel.grid.major =
element_line(colour = colors["Medium Gray"]),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0, size = rel(1.5), face = "bold"),
plot.margin = unit(c(1, 1, 1, 1), "lines"),
strip.background = element_rect()
))
}
I personally edited this so that it didn’t remove axis titles, and to have a white background instead of gray, which you can see below.
theme_538 <- function(base_size = 12, font = "Lato") {
# Text setting
txt <- element_text(size = base_size + 2, colour = "black", face = "plain")
bold_txt <- element_text(
size = base_size + 2, colour = "black",
family = "Montserrat", face = "bold"
)
large_txt <- element_text(size = base_size + 4, color = "black", face = "bold")
theme_minimal(base_size = base_size, base_family = font) +
theme(
# Legend Settings
legend.key = element_blank(),
legend.background = element_blank(),
legend.position = "bottom",
legend.direction = "horizontal",
legend.box = "vertical",
# Backgrounds
strip.background = element_blank(),
strip.text = large_txt,
plot.background = element_rect(),
plot.margin = unit(c(1, 1, 1, 1), "lines"),
# Axis & Titles
text = txt,
axis.text = txt,
axis.ticks = element_blank(),
axis.line = element_blank(),
axis.title = bold_txt,
plot.title = large_txt,
# Panel
panel.grid = element_line(colour = NULL),
panel.grid.major = element_line(colour = "#D2D2D2"),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.border = element_blank()
)
}
Regardless - the idea here is that you can:
All are valid, but you don’t necessarily have to actually manually code the theme element changes to each and every plot. You can at the least write your own theme as a function and use it. Alternatively, you can write your own package (easier than it sounds!) and source that.
If you would like to read more about customizing your OWN theme - check out the great resource by Simon Jackson at his blog.
Basic line chart = ggplot() + geom_line()
# Prepare data
wr_duel <- pbp_rp %>%
filter(receiver %in% c("A.Brown", "J.Smith-Schuster")) %>%
group_by(game_date, receiver) %>%
summarize(mean_epa = mean(epa, na.rm = TRUE))
ggplot(
wr_duel,
aes(x = game_date, y = mean_epa, color = receiver)
) +
geom_line(size = 1)
wr_duel_plot <- ggplot(
wr_duel,
aes(x = game_date, y = mean_epa, color = receiver)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(
x = "\nGame Date",
y = "EPA (Average)",
title = "Quick comparison of AB vs Juju across the 2018 season",
caption = "Data: @nflscrapR"
)
wr_duel_plot
But we can still improve this a lot - it feels a bit crowded, plus the red/blue colro scheme doesn’t align with the team’s color or anything else. We can add colored text via the ggtext package, or we can manually change the colors. Also note that you can grab the team’s colors via teamcolors package.
pit_colors <- teamcolors %>%
filter(name == "Pittsburgh Steelers") %>%
select(name:secondary)
pit_colors
## # A tibble: 1 x 4
## name league primary secondary
## <chr> <chr> <chr> <chr>
## 1 Pittsburgh Steelers nfl #000000 #ffb612
pit_primary <- pull(pit_colors, primary)
pit_secondary <- pull(pit_colors, secondary)
wr_duel_plot <- ggplot(wr_duel,
aes(x = game_date, y = mean_epa,
color = if_else(receiver == "A.Brown", pit_primary, pit_secondary))) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(x = "",
y = "EPA (Average)",
title = "Quick comparison of <span style='color:#000000'>**AB**</span> vs <span style='color:#ffb612'>**Juju**</span> across the 2018 season",
caption = "Data: @nflscrapR") +
scale_color_identity() +
theme(plot.title = element_markdown())
wr_duel_plot
Alternatively, if you didn’t want to drop a legend, you could approach it this way.
wr_duel_plot <- ggplot(
wr_duel,
aes(
x = game_date, y = mean_epa,
color = receiver
)
) +
geom_line(size = 1) +
theme_538() +
geom_hline(yintercept = 0, size = 1, color = "black") +
labs(
x = "",
y = "EPA (Average)",
title = "Quick comparison of AB vs Juju across the 2018 season",
caption = "Data: @nflscrapR"
) +
scale_color_manual(values = c(pit_primary, pit_secondary)) +
theme(
legend.title = element_blank(),
legend.position = c(0.2, 0.1)
)
wr_duel_plot
wr_duel_plot +
theme(legend.position = "none") +
geom_text(data = filter(wr_duel, game_date == "2018-09-09"),
aes(x = game_date, y = mean_epa, label = receiver),
hjust = 0, nudge_y = 0.1, size = 4
) +
geom_point(data = filter(wr_duel, game_date == "2018-09-09"),
size = 3
)
You could also try out a connected line plot which lets you plot an x and y axis, then assign “time” as a 3rd variable. I find that it helps to add a connecting line, or else you may have trouble following the linear change in time.
# Prepare data
juju_do_it <- pbp_rp %>%
filter(receiver == "J.Smith-Schuster") %>%
arrange(desc(game_date)) %>%
group_by(game_date) %>%
summarize(
total_yards = sum(yards_gained, na.rm = TRUE),
total_airyards = sum(air_yards, na.rm = TRUE)
) %>%
head(5) %>%
mutate(
game_num = row_number(),
game_text = glue::glue("Game {game_num}")
)
ggplot(juju_do_it, aes(x = total_airyards, y = total_yards, color = game_num)) +
geom_point(size = 5) +
# geom path follows the order of underlying data
geom_path(size = 2) +
# creates a line for comparison
geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed") +
# adds labels to only game 1 and 5
geom_text(
data = filter(juju_do_it, game_num %in% c(1, 5)),
aes(label = game_text),
hjust = 1, nudge_x = -5
) +
# set scales for 0-axis
scale_x_continuous(limits = c(0, 140)) +
scale_y_continuous(limits = c(0, 140)) +
# change color gradient to start at black and transition to yellow
scale_color_gradient(low = pit_primary, high = pit_secondary) +
theme_538() +
labs(
x = "\nTotal Air Yards",
y = "Total Yards\n",
title = "Even with his highest Air Yardage, Juju struggled in Game 4",
caption = "Data: @nflscrapR"
) +
theme(legend.position = "none")
Notice that the above plot has a diagonal trend line that runs intersecting at 0 with a slope of 1:
geom_abline(intercept = 0, slope = 1, color = "grey", linetype = "dashed")
This basic call can be applied to lots of different plots to give a reference line, where you can separate plays/players/teams into above/below the line.
Everyone’s favorite - bar charts! But always remember that bar charts can limit information - we’ll look at distribution plots of various types later, but for now back to the bar.
Basic forms:
ggplot(aes(x = category, y = value)) + geom_col()ggplot(aes(x = category, y = value)) + geom_bar(stat = "identity")Column defaults to identity, essentially the single number is read as the max value. geom_bar() on the other hand has a bit more flexibility if you want to build stacked or segmented bar charts.
rb_trio <- pbp_rp %>%
filter(
posteam == "PIT",
receiver %in% c("J.Conner", "J.Samuels", "S.Ridley") |
rusher %in% c("J.Conner", "J.Samuels", "S.Ridley"),
play_type != "no_play"
) %>%
mutate(
# Assign a single player name for filtering regardless of play_type
player = if_else(is.na(receiver), rusher, receiver),
# Add nice labels to play_type
play_type = factor(play_type, labels = c("Reception", "Rush"))
) %>%
group_by(player, play_type) %>%
summarize(
n = n(),
mean_yards = sum(yards_gained, na.rm = TRUE) / n,
mean_success = sum(success, na.rm = TRUE) / n
)
rb_trio_plot <- rb_trio %>%
ggplot(aes(x = player, y = mean_yards)) +
geom_col(aes(fill = play_type), position = "dodge")
rb_trio_plot
Something to notice above - we have created a “grouped” bar chart, where the bars are grouped by player and color is assigned to play type. We can split this out into facets as an alternative representation.
rb_trio_plot <- rb_trio %>%
ggplot(aes(x = player, y = mean_yards, fill = player, position = "dodge", group = play_type)) +
geom_col() +
facet_grid(~play_type)
rb_trio_plot
Now we are adding color by player and separating into small multiples or facets that represent the play type. Any categorical variable could be used in this fashion - you could essentially build the plot 1x and then facet by a factor to generate N versions of that graph all plotted together.
rb_trio_plot +
geom_hline(yintercept = 0.03, color = "black", size = 2) +
theme_538() +
scale_fill_manual(values = c(pit_primary, pit_secondary, "grey")) +
labs(
x = "",
y = "Avg Yards per Play",
title = "Conner and Samuels were interchangeable in 2018",
subtitle = "Ridley is no longer on the team",
caption = "Data: @nflscrapR"
) +
theme(
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 1),
panel.ontop = TRUE,
legend.position = "none"
) +
scale_y_continuous(
breaks = seq(0, 6, 1)
)
rb_trio_plot +
geom_hline(yintercept = 0, color = "black", size = 2) +
theme_538() +
scale_fill_manual(values = c(pit_primary, pit_secondary, "grey")) +
labs(
x = "",
y = "Avg Yards per Play",
title = "Conner and Samuels were interchangeable in 2018",
subtitle = "Ridley is no longer on the team",
caption = "Data: @nflscrapR"
) +
theme(
panel.grid.major.x = element_blank(),
legend.position = "none"
) +
scale_y_continuous(
breaks = seq(0, 6, 1)
)
epa_play <- pbp_rp %>%
filter(pass == 1) %>%
group_by(posteam) %>%
summarize(
n = n(),
epa_per_db = sum(epa, na.rm = TRUE) / n,
success_rate = sum(epa) / n
)
epa_play %>%
ggplot(aes(x = posteam, y = epa_per_db)) +
geom_col()
This could be a useful summary, but there’s a few issues.
So let’s try rotating the bar plot.
epa_play %>%
ggplot(aes(x = epa_per_db, y = reorder(posteam, epa_per_db), )) +
geom_col()
Yikes - that is not what we want! Instead of just swapping the x and y axes, we should have used coord_flip() - this will actually rotate the plot rather than change the structure.
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "green", "red"))) +
coord_flip() +
scale_fill_identity()
Now this is more readable, clearly arranged by the strong passing vs weak passing teams, but still could be improved. Namely, red/green is not ideal for color-blindness, and the default red/green are pretty abrasively bright! However, we can still improve the grid lines (don’t need horizontal), add some better labels, and finish out the plot.
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
coord_flip() +
scale_fill_identity() +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
There are some alternative reproducible methods for various bar plots on one of my other guides.
Basic form:
ggplot(aes(x = category, y = value)) + geom_col(width = 0.2) + geom_point()epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
width = 0.2
) +
geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
size = 5
) +
coord_flip() +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_col(aes(fill = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"))) +
geom_text(aes(
label = posteam,
color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
hjust = if_else(epa_per_db > 0, -0.1, 1.1)
)) +
coord_flip() +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank()
) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
epa_play %>%
ggplot(aes(x = reorder(posteam, epa_per_db), y = epa_per_db)) +
geom_point(aes(color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c")),
size = 3
) +
geom_text(aes(
label = posteam,
color = if_else(epa_per_db >= 0, "#2c7bb6", "#d7181c"),
hjust = if_else(epa_per_db > 0, -0.2, 1.2)
)) +
coord_flip() +
scale_fill_identity(aesthetics = c("fill", "colour")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
axis.text.y = element_blank()
) +
geom_hline(yintercept = 0) +
scale_y_continuous(breaks = seq(-0.2, 0.3, 0.1)) +
labs(
x = "",
y = "EPA per Dropback",
title = "The majority of teams had positive EPA/dropback",
subtitle = "But there are some clear outliers",
caption = "Data: @nflscrapR"
)
In this case, the Y-axis is essentially rank - you could also revert back to just doing this as team logos or adding another variable on the y-axis. This plot is ink efficient, but also has a LOT of unused white space as a result. As such, I don’t think it is a “great” plot.
Back to stealing from Ben - who has done a great job generating interesting scatter plots. Let’s do his cleanup and then some viz. Step 1 cleans up player names and is verbatim copied from his repo.
pbp_players <- pbp_rp %>%
mutate(
passer_player_name = ifelse(play_type == "no_play" & pass == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
passer_player_name
),
receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
str_extract(
desc,
"(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
),
receiver_player_name
),
rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)| (up the middle)|(right guard)|(right tackle)|(right end)))"),
rusher_player_name
)
)
Step 2 generates our summary dataframe with a few plays of interest. ALWAYS remember to add an ungroup() as otherwise the grouped assignment lives on in the dataset.
qbs <- pbp_players %>%
mutate(
name = ifelse(!is.na(passer_player_name), passer_player_name, rusher_player_name),
rusher = rusher_player_name,
receiver = receiver_player_name,
play = 1
) %>%
group_by(name, posteam) %>%
summarize(
n_dropbacks = sum(pass),
n_rush = sum(rush),
n_plays = sum(play),
epa_per_play = sum(epa) / n_plays,
success_per_play = sum(success) / n_plays
) %>%
filter(n_dropbacks >= 100) %>%
ungroup() # always ungroup if you no longer need the grouping effect
Basic form:
ggplot(aes(x = value, y = other_value)) + geom_point()qb_success_rate <- qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
geom_point() +
labs(x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays") +
theme_bw() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12))
qb_success_rate
This is a nice plot, minorly scaled back from Ben’s example code. There is a clear linear relationship between succcess rate (EPA > 0) and EPA per Play, which makes sense.
We could add back in a few of Ben’s code examples to improve it.
qb_success_rate +
geom_hline(yintercept = mean(qbs$epa_per_play), color = "red", linetype = "dashed") +
geom_vline(xintercept = mean(qbs$success_per_play), color = "red", linetype = "dashed")
This adds lines at the averages for each axis to help with comparison.
We could also accomplish this with the code below. In the below example, it is initially more verbose but also gives you a saved data point to work with, and could be useful if for example you wanted to do a group_by summary or a filter, basically anything beyond just a pure mean.
qb_epa_per_play <- qbs %>%
summarize(mean = mean(epa_per_play)) %>%
pull(mean)
qb_success_per_play <- qbs %>%
summarize(mean = mean(success_per_play)) %>%
pull(mean)
qb_success_rate +
geom_hline(yintercept = qb_epa_per_play, color = "red", linetype = "dashed") +
geom_vline(xintercept = qb_success_per_play, color = "red", linetype = "dashed")
We could also add a linear trendline to this plot. Either method shown below is valid, but stat_smooth allows for some additional customization.
qb_success_rate +
stat_smooth(method = "lm", geom = "line", alpha = 0.5, se = FALSE, color = "red", size = 1)
qb_success_rate +
geom_smooth(method = "lm", se = FALSE, color = "red")
Now Ben has 2x variables assigned as aesthetics in this plot, success rate as X, EPA/play as Y.
He also added a 3rd variable (size) as an aesthetic. Importantly, because we are putting size and color INSIDE aes() we get to use traditional tidyverse evaluation, so we can reference columns directly, like you see with n_plays and posteam.
qbs %>%
ggplot(
aes(x = success_per_play, y = epa_per_play)
) +
# Notice that color/size inside aes()
geom_point(
aes(
color = if_else(posteam == "SF", "red", "black"),
size = n_plays / 60
),
alpha = 0.50
) +
# we need this to assign red/black to the actual color
scale_color_identity() +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
We can then add nice labels to ALL the players via ggrepel which automatically repels labels so there is minimal to no overlap.
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
# Notice that color/size inside aes()
geom_point(aes(color = if_else(posteam == "SF", "red", "black"), size = n_plays / 60), alpha = 0.50) +
# we need this to assign red/black to the actual color
scale_color_identity() +
# add labels for all players
geom_text_repel(aes(label = name, color = if_else(posteam == "SF", "red", "black")),
force = 1, point.padding = 0.1,
segment.size = 0.2
) +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
But that’s a LOT of names that we aren’t interested in if we want to talk about just the San Francisco QBs.
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
# Notice that color/size inside aes()
geom_point(aes(color = if_else(posteam == "SF", "red", "black"), size = n_plays / 60), alpha = 0.50) +
# we need this to assign red/black to the actual color
scale_color_identity() +
# add labels JUST for SF
geom_text_repel(
data = filter(qbs, posteam == "SF"),
aes(label = name), color = "red",
force = 1, point.padding = 0.1,
segment.size = 0.2
) +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
Staying with our San Francisco example, we can also go about this process differently to answer how did Jimmy G. and Nick the Mullet compare? We can add nice annotations via the ggforce package for just the two players of interest.
qbs %>%
ggplot(aes(x = success_per_play, y = epa_per_play)) +
# Notice that color/size inside aes()
geom_point(aes(
color = if_else(posteam == "SF", "red", "black"),
size = n_plays / 60
),
alpha = 0.50
) +
# we need this to assign red/black to the actual color
scale_color_identity() +
# add labels JUST for Mullens/Garoppolo with ggforce
geom_mark_hull(
aes(
filter = name %in% c("J.Garoppolo", "N.Mullens"),
description = "Mullens + Garoppolo performed similarly in 2018"
),
color = "red", label.fontface = "bold", label.colour = "red", con.colour = "red"
) +
labs(
x = "Success rate",
y = "EPA per play",
caption = "Data from nflscrapR",
title = "QB success rate and EPA/play",
subtitle = "2018, min 100 pass attempts, includes all QB's rush and pass plays"
) +
theme_bw() +
theme(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 14, hjust = 0.5),
plot.caption = element_text(size = 12)
) +
theme(legend.position = "none")
Rather than summarizing data into columns/points we can also display the distribution of the data points. The most common distribution plots are:
You could also consider a stacked boxplot + jitter plot as showing the distribution.
The basic idea of a histogram is that the data is binned along some range (2 airyards in below example) across the x axis, and all values of x that fall within this count add to the total count for that specific bin.
Basic form:
ggplot(aes(x = value)) + geom_histogram()
Let’s take a look at KC and SEA, teams with very different approaches to their offenses.
sea_color <- teamcolors %>%
filter(name == "Seattle Seahawks") %>%
pull(primary)
kc_color <- teamcolors %>%
filter(name == "Kansas City Chiefs") %>%
pull(primary)
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
group_by(posteam, play_type) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n))
## # A tibble: 4 x 4
## # Groups: posteam [2]
## posteam play_type n freq
## <chr> <chr> <int> <dbl>
## 1 KC pass 606 0.619
## 2 KC run 373 0.381
## 3 SEA pass 465 0.474
## 4 SEA run 515 0.526
So KC threw the ball almost 62% of the time, while Sea only threw the ball about 47% of the time!
But what does the distribution of throws look like between KC and SEA?
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, fill = posteam)) +
geom_histogram(binwidth = 2)
## Warning: Removed 79 rows containing non-finite values (stat_bin).
The basic histogram is “fine” but let’s spruce it up a bit! We can add our theme, the team colors, and some better labels.
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, fill = posteam)) +
geom_histogram(binwidth = 2, alpha = 0.9) +
scale_fill_manual(values = c(kc_color, sea_color)) +
geom_hline(yintercept = 0, size = 1) +
theme_538() +
theme(
legend.title = element_blank(),
legend.position = c(0.6, 0.9)
) +
scale_x_continuous(breaks = seq(-10, 60, 10)) +
labs(
x = "\nAir Yards",
y = "Count",
title = "KC threw more passes at all ranges",
caption = "Data: @nflscrapR"
)
## Warning: Removed 79 rows containing non-finite values (stat_bin).
“Computes and draws kernel density estimate, which is a smoothed version of the histogram. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.” - ggplot2 docs
Basic form:
ggplot(aes(x = value)) + geom_density()
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, fill = posteam)) +
geom_density(alpha = 0.8) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(
legend.title = element_blank(),
legend.position = c(0.6, 0.9)
) +
scale_x_continuous(breaks = seq(-10, 60, 10))
## Warning: Removed 79 rows containing non-finite values (stat_density).
An important point - I try not to focus on the Y axis for either histogram/density plots as we are looking at the distribution itself rather than specific numbers. You can scale out the y-axis in a few ways for density plots, which I’ll demonstrate below.
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, y = ..scaled.., fill = posteam)) +
geom_density(alpha = 0.8) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(
legend.title = element_blank(),
legend.position = c(0.6, 0.9)
) +
scale_x_continuous(breaks = seq(-10, 60, 10))
## Warning: Removed 79 rows containing non-finite values (stat_density).
Interestingly though, we see that KC and SEA essentially attacked the field in the same way, BUT SEA threw so many fewer passes which was captured in the histogram.
A nice addon to density plots is through the ggridges package, which allows for the creation of stacked density and histogram plots.
Basic form:
ggplot(aes(x = value, y = category)) + geom_density_ridges()
pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = air_yards, y = posteam, fill = posteam)) +
geom_density_ridges() +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
legend.position = "none"
) +
scale_x_continuous(breaks = seq(-10, 60, 10)) +
labs(
x = "Air Yards",
y = "",
title = "SEA and KC pass to similar depths of the field",
caption = "Data: @nflscrapR"
)
## Picking joint bandwidth of 2.53
## Warning: Removed 79 rows containing non-finite values
## (stat_density_ridges).
Boxplots are another way of showing central tendency + range of a distribution, but they can still have their quirks or difficulties in explanations. I typically find that adding a geom_jitter() call on top of the boxplot helps with showing both the distribution and the central tendency/range, but YMMV.
Basic form:
ggplot(aes(x = category, y = value)) + geom_boxplot()pbp_rp %>%
filter(play_type == "pass") %>%
filter(posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = posteam, y = air_yards, fill = posteam)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.2) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(legend.position = "none")
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
## Warning: Removed 79 rows containing missing values (geom_point).
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = play_type, y = epa , fill = play_type)) +
geom_boxplot() +
geom_jitter(width = 0.3, alpha = 0.1) +
scale_fill_manual(values = c(kc_color, sea_color)) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
geom_sina() from the ggforce package is an alternative to the above wokflow, and is somewhat similar to a vertical geom_density()
Basic form:
ggplot(aes(x = category, y = value)) + geom_sina()pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(aes(x = play_type, y = epa, color = posteam)) +
geom_sina(alpha = 0.5) +
scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
geom_beeswarm() from the beeswarm package is an alternative to the above wokflow, but is almost identical to geom_sina() in its basic form.
Basic form:
ggplot(aes(x = category, y = value)) + geom_beeswarm()library(ggbeeswarm)
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(
aes(x = play_type, y = epa, color = posteam)
) +
geom_beeswarm(alpha = 0.5) +
scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
Importantly, although this looks very similar to the geom_sina() plots, you have more options about customizing the “swarming”.
pbp_rp %>%
filter(play_type != "no_play", posteam %in% c("SEA", "KC")) %>%
ggplot(
aes(x = play_type, y = epa, color = posteam)
) +
geom_beeswarm(priority = "random", alpha = 0.5, size = 0.5) +
scale_fill_manual(values = c(kc_color, sea_color), aesthetics = c("fill", "color")) +
theme_538() +
theme(legend.position = "none") +
facet_grid(~posteam)
Dumbell plots are typically best served comparing two summary numbers within a group.
Basic form:
ggplot(aes(x = value, y = category, group = group)) + geom_line() + geom_point() + coord_flip()rush_v_pass <- pbp_rp %>%
filter(play_type != "no_play", penalty == 0) %>%
group_by(play_type, posteam) %>%
summarize(avg_yds = mean(yards_gained, na.rm = TRUE)) %>%
ungroup()
nfl_rvp <- pbp_rp %>%
filter(play_type != "no_play") %>%
group_by(play_type) %>%
summarize(avg_yds = mean(yards_gained, na.rm = TRUE)) %>%
ungroup() %>%
mutate(posteam = "NFL")
rush_v_pass <- bind_rows(rush_v_pass, nfl_rvp) %>%
mutate(play_type = factor(play_type,
levels = c("pass", "run"),
labels = c("Pass", "Rush")))
rush_v_pass %>%
ggplot(aes(x = fct_rev(fct_reorder2(posteam, desc(play_type), avg_yds)), y = avg_yds, color = play_type)) +
geom_line(aes(group = posteam), color = "grey", size = 3) +
geom_point(size = 5) +
coord_flip()
Adding some additional aesthetic changes to improve the graph:
rush_v_pass %>%
ggplot(aes(x = fct_rev(fct_reorder2(posteam, desc(play_type), avg_yds)), y = avg_yds, color = play_type)) +
geom_line(aes(group = posteam), color = "grey", size = 3) +
geom_point(size = 5) +
geom_text(
data = filter(rush_v_pass, posteam == "KC" & play_type == "Pass"),
aes(label = play_type),
hjust = 0, nudge_y = 0.2, fontface = "bold", size = 6
) +
geom_text(
data = filter(rush_v_pass, posteam == "KC" & play_type == "Rush"),
aes(label = play_type),
hjust = 1, nudge_y = -0.2, fontface = "bold", size = 6
) +
coord_flip() +
scale_color_manual(values = c("#003399", "#ff2b4f")) +
theme_538() +
theme(
panel.grid.major.y = element_blank(),
legend.position = "none",
axis.text.y = element_text(color = if_else(rush_v_pass$posteam == "NFL", "red", "black"))
) +
labs(
x = "",
y = "\n Average Yards Gained",
title = "Passing yards per play outperforms Rushing for all teams",
caption = "Data: @nflscrapR"
) +
scale_y_continuous(
limits = c(3, 9),
breaks = seq(3, 8, 1)
)
A slope chart allows you to show the change/trend between two points, most appropriately as a two point time series.
The basic form:
ggplot(aes(x = time, y = value, group = group)) + geom_line() + geom_point()This demo takes a bit of prep, you could skip this simply by using game_date instead of game_week, but I think it’s a useful code-chunk for you to have in the boolbox.
case_when() is essentially a switch or a combination of if_else() statements.We can use it to have a bunch of arguments where you can match some argument and then output something specific. Here we are checking if the game_date is between the two dates for that week’s games and then assigning game_week of the season.
game_num <- pbp_rp %>%
mutate(game_week = case_when(
between(game_date, as.Date("2018-09-05"), as.Date("2018-09-11")) ~ 1,
between(game_date, as.Date("2018-09-12"), as.Date("2018-09-18")) ~ 2,
between(game_date, as.Date("2018-09-19"), as.Date("2018-09-25")) ~ 3,
between(game_date, as.Date("2018-09-26"), as.Date("2018-10-02")) ~ 4,
between(game_date, as.Date("2018-10-03"), as.Date("2018-10-09")) ~ 5,
between(game_date, as.Date("2018-10-10"), as.Date("2018-10-16")) ~ 6,
between(game_date, as.Date("2018-10-17"), as.Date("2018-10-23")) ~ 7,
between(game_date, as.Date("2018-10-24"), as.Date("2018-10-30")) ~ 8,
between(game_date, as.Date("2018-10-31"), as.Date("2018-11-06")) ~ 9,
between(game_date, as.Date("2018-11-07"), as.Date("2018-11-13")) ~ 10,
between(game_date, as.Date("2018-11-14"), as.Date("2018-11-20")) ~ 11,
between(game_date, as.Date("2018-11-21"), as.Date("2018-11-27")) ~ 12,
between(game_date, as.Date("2018-11-28"), as.Date("2018-12-04")) ~ 13,
between(game_date, as.Date("2018-12-05"), as.Date("2018-12-11")) ~ 14,
between(game_date, as.Date("2018-12-12"), as.Date("2018-12-18")) ~ 15,
between(game_date, as.Date("2018-12-19"), as.Date("2018-12-25")) ~ 16,
between(game_date, as.Date("2018-12-30"), as.Date("2019-01-01")) ~ 17,
TRUE ~ 99
)
) %>%
filter(game_week != 99)
Now we can clean up some of the factors for better printing and limit to KC, the most offensively efficient team in 2018. We’ll be looking at 1st Half vs 2nd Half Runs & Passes.
wk_rvp <- game_num %>%
filter(play_type != "no_play", game_half %in% c("Half1", "Half2")) %>%
mutate(game_half = if_else(game_half == "Half1", "1st Half", "2nd Half")) %>%
group_by(posteam, game_half, game_week, play_type) %>%
count() %>%
ungroup()
kc_rvp <- wk_rvp %>%
filter(posteam == "KC") %>%
mutate(game_num = if_else(game_week <=11, game_week, game_week - 1),
play_type = if_else(play_type == "run", "Rush", "Pass"),
game_text = glue::glue("Game {game_num}")
)
kc_rvp
## # A tibble: 64 x 7
## posteam game_half game_week play_type n game_num game_text
## <chr> <chr> <dbl> <chr> <int> <dbl> <glue>
## 1 KC 1st Half 1 Pass 15 1 Game 1
## 2 KC 1st Half 1 Rush 11 1 Game 1
## 3 KC 1st Half 2 Pass 11 2 Game 2
## 4 KC 1st Half 2 Rush 10 2 Game 2
## 5 KC 1st Half 3 Pass 27 3 Game 3
## 6 KC 1st Half 3 Rush 14 3 Game 3
## 7 KC 1st Half 4 Pass 16 4 Game 4
## 8 KC 1st Half 4 Rush 14 4 Game 4
## 9 KC 1st Half 5 Pass 25 5 Game 5
## 10 KC 1st Half 5 Rush 11 5 Game 5
## # … with 54 more rows
Notice that we have nice text for game_half and play_type and a game_num variable. Let’s build the basic slope chart. You need SOME type of grouping variable as your 3rd variable eg (var1 = x, var2 = y, var3 = group).
kc_rvp %>%
ggplot(aes(x = game_half, y = n, group = game_num)) +
geom_point() +
geom_line() +
facet_grid(~play_type)
This is interesting and shows the key feature of how does the trend of 1st Half vs 2nd Half Rush vs Pass look like.
However, we aren’t sure which games match to fewer first-half rushes vs more second-half rushes We need to assign a color variable in our aes() call. But first, let’s figure out the games where KC rushed more in the 1st half (aka ESTABLISH THE RUN) vs the 2nd half (REAP THE REWARDS OF ESTABLISHMENT).
kc_runs <- kc_rvp %>%
filter(play_type == "Rush") %>%
spread(game_half, n) %>%
mutate(balance = if_else(`1st Half` >= `2nd Half`, "Ran More in 1st", "Ran More in 2nd")) %>%
gather(key = "game_half", value = "n", `1st Half`:`2nd Half`) %>%
select(posteam, game_num,game_half, balance)
kc_runs
## # A tibble: 32 x 4
## posteam game_num game_half balance
## <chr> <dbl> <chr> <chr>
## 1 KC 1 1st Half Ran More in 2nd
## 2 KC 2 1st Half Ran More in 2nd
## 3 KC 3 1st Half Ran More in 1st
## 4 KC 4 1st Half Ran More in 1st
## 5 KC 5 1st Half Ran More in 2nd
## 6 KC 6 1st Half Ran More in 2nd
## 7 KC 7 1st Half Ran More in 2nd
## 8 KC 8 1st Half Ran More in 2nd
## 9 KC 9 1st Half Ran More in 2nd
## 10 KC 10 1st Half Ran More in 2nd
## # … with 22 more rows
kc_runs %>%
filter(balance == "Ran More in 1st") %>%
distinct(game_num)
## # A tibble: 4 x 1
## game_num
## <dbl>
## 1 3
## 2 4
## 3 11
## 4 15
So games 3, 4, 11 and 15 were the only games were they ran more or equal amounts in the 1st Half than the 2nd Half. Fun fact - KC went 12-4 last season, and went 2-2 in the games where they ran equally or more in the 1st Half than the 2nd Half.
kc_rvp %>%
ggplot(
aes(
x = game_half, y = n, group = game_week,
color = if_else(game_num %in% c(3, 4, 11, 15), "red", "blue")
)
) +
geom_point() +
geom_line() +
geom_text_repel(
data = filter(
kc_rvp, game_num %in% c(3, 4, 11, 15),
game_half == "2nd Half"
),
aes(label = game_num)
) +
facet_grid(~play_type) +
scale_color_identity()
Ok now we can see that in 3/4 of the ONLY games where they “established” the run in the 1st half they ended up passing dramatically more in the 2nd half (4, 11, 15). Let’s add some more context and details.
kc_rvp %>%
ggplot(
aes(
x = game_half, y = n, group = game_week,
color = if_else(game_num %in% c(3, 4, 11, 15), "#ff2b4f", "#003399")
)
) +
geom_point() +
geom_vline(xintercept = c(1, 2), size = 2, color = "black", alpha = 0.5) +
geom_line(size = 2) +
geom_point(size = 5) +
geom_text_repel(
data = filter(
kc_rvp, game_num %in% c(3, 4, 11, 15),
game_half == "2nd Half"
),
aes(label = game_text),
direction = "y", nudge_x = 0.1, segment.size = 0.1, hjust = 0,
size = 5, fontface = "bold"
) +
facet_grid(~play_type) +
scale_color_identity() +
theme_538() +
theme(panel.grid.major.x = element_blank()) +
labs(x = "", y = "N of Plays\n",
title = "In 3 of 4 games where KC established the run they ended up throwing more in the 2nd half",
subtitle = "They went 2-2 in these games, and 10-2 in their other games",
caption = "Data: @nflscrapR")
The big players here are manually changing colors and adding filtered data to add text labels for only the points of interest.
That’s all for this section - on to Tables!
You can create beautiful static and interactive tables in R through the gt and DT packages respectively!
gtThe gt package is essentially a grammar of tables, allowing you to quickly build out tables and output to RTF, HTML, or LaTeX.
Let’s do a quick analysis!
Let’s go back to our schotty example!
schotty
## # A tibble: 32 x 3
## posteam mean_pass plays
## <chr> <dbl> <int>
## 1 SEA 0.369 320
## 2 JAX 0.435 276
## 3 TEN 0.441 263
## 4 BUF 0.452 219
## 5 BAL 0.458 299
## 6 ARI 0.466 236
## 7 NYJ 0.473 256
## 8 DET 0.482 299
## 9 WAS 0.485 239
## 10 CAR 0.491 281
## # … with 22 more rows
We can quickly convert this to a table!
schotty %>%
slice(1:5, 28:32) %>%
gt()
| posteam | mean_pass | plays |
|---|---|---|
| SEA | 0.3687500 | 320 |
| JAX | 0.4347826 | 276 |
| TEN | 0.4410646 | 263 |
| BUF | 0.4520548 | 219 |
| BAL | 0.4581940 | 299 |
| TB | 0.5847176 | 301 |
| PHI | 0.5855263 | 304 |
| GB | 0.5939850 | 266 |
| KC | 0.6342412 | 257 |
| PIT | 0.6634304 | 309 |
And then we can make some changes!
schotty_gt <- schotty %>%
slice(1:5, 28:32) %>%
arrange(desc(mean_pass)) %>%
mutate(play_focus = if_else(mean_pass >= .50, "Pass Heavy", "Run Heavy")) %>%
group_by(play_focus) %>%
gt()
schotty_gt
| posteam | mean_pass | plays |
|---|---|---|
| Pass Heavy | ||
| PIT | 0.6634304 | 309 |
| KC | 0.6342412 | 257 |
| GB | 0.5939850 | 266 |
| PHI | 0.5855263 | 304 |
| TB | 0.5847176 | 301 |
| Run Heavy | ||
| BAL | 0.4581940 | 299 |
| BUF | 0.4520548 | 219 |
| TEN | 0.4410646 | 263 |
| JAX | 0.4347826 | 276 |
| SEA | 0.3687500 | 320 |
schotty_gt %>%
fmt_percent(columns = vars(mean_pass), decimals = 1) %>%
tab_header(
title = "Percentage of Passes by teams on 1st/2nd Down in 1st Half",
subtitle = "Win Prob between 20 & 80, excludes final 2 minutes of the half"
) %>%
cols_label(
posteam = "Player",
mean_pass = "Pass %",
plays = "Plays"
) %>%
cols_align(
align = "center"
) %>%
tab_source_note(
source_note = "Table: @thomas_mock | Data: @nflscrapR"
)
| Percentage of Passes by teams on 1st/2nd Down in 1st Half | ||
|---|---|---|
| Win Prob between 20 & 80, excludes final 2 minutes of the half | ||
| Player | Pass % | Plays |
| Pass Heavy | ||
| PIT | 66.3% | 309 |
| KC | 63.4% | 257 |
| GB | 59.4% | 266 |
| PHI | 58.6% | 304 |
| TB | 58.5% | 301 |
| Run Heavy | ||
| BAL | 45.8% | 299 |
| BUF | 45.2% | 219 |
| TEN | 44.1% | 263 |
| JAX | 43.5% | 276 |
| SEA | 36.9% | 320 |
| Table: @thomas_mock | Data: @nflscrapR | ||
For this example, we’ll grab just some specific players:
* Primarily Slot Receivers
* Stud RBs
* Stud TEs
And compare their performance when catching the ball on 3rd down, with a few specific criteria.
# 2018 and pass plays
pass_2018 <- pbp_rp %>%
filter(play_type == "pass", penalty == 0, sack == 0, qb_scramble == 0)
third_down_passes <- pass_2018 %>%
filter(down == 3, ydstogo <= 10) %>%
group_by(receiver_player_name) %>%
mutate(converted = if_else(yards_gained > ydstogo, 1, 0)) %>%
select(receiver_player_name, yards_gained, ydstogo, epa, converted) %>%
summarise(
mean_epa = mean(epa, na.rm = TRUE),
mean_yardage = mean(yards_gained, na.rm = TRUE),
mean_ydstogo = mean(ydstogo, na.rm = TRUE),
n = n(),
conv_rate = sum(converted) / n
) %>%
ungroup() %>%
arrange(desc(conv_rate))
rbs <- c(
"A.Kamara", "J.White", "J.Conner", "C.McCaffrey", "S.Barkley", "E.Elliott",
"J.Mixon", "T.Gurley", "D.Johnson", "M.Gordon"
)
wrs <- c(
"D.Westbrook", "A.Humphries", "C.Kupp", "G.Tate", "D.Pettis", "J.Edelman",
"C.Kupp", "W.Snead IV", "M.Sanu", "T.Lockett", "T.Gabriel", "S.Shepard", "C.Beasley"
)
tes <- c("T.Kelce", "Z.Ertz", "G. Kittle", "E.Engram", "J.Cook", "E.Ebron")
top_players <- c(rbs, wrs, tes)
Now that we have the dataframe setup, we can create a quick table.
third_conv_table <- third_down_passes %>%
filter(n >= 10) %>%
mutate(position = case_when(
receiver_player_name %in% rbs ~ "RB",
receiver_player_name %in% wrs ~ "WR",
receiver_player_name %in% tes ~ "TE",
TRUE ~ NA_character_
),
position = factor(position, levels = c("RB", "WR", "TE"))
) %>%
filter(receiver_player_name %in% top_players) %>%
select(receiver_player_name, conv_rate, n, everything(), -mean_epa) %>%
group_by(position) %>%
arrange(desc(conv_rate)) %>%
ungroup() %>%
gt::gt(groupname_col = "position")
third_conv_table
| receiver_player_name | conv_rate | n | mean_yardage | mean_ydstogo |
|---|---|---|---|---|
| WR | ||||
| C.Kupp | 0.6666667 | 12 | 14.500000 | 5.750000 |
| W.Snead IV | 0.6470588 | 17 | 8.294118 | 5.882353 |
| M.Sanu | 0.6363636 | 22 | 7.818182 | 5.590909 |
| T.Lockett | 0.6315789 | 19 | 15.052632 | 5.736842 |
| T.Gabriel | 0.6250000 | 16 | 8.812500 | 6.250000 |
| D.Westbrook | 0.6206897 | 29 | 7.482759 | 5.620690 |
| A.Humphries | 0.5500000 | 20 | 5.750000 | 5.550000 |
| C.Beasley | 0.5384615 | 26 | 8.153846 | 6.038462 |
| J.Edelman | 0.4000000 | 20 | 5.550000 | 4.850000 |
| D.Pettis | 0.3846154 | 13 | 11.384615 | 6.769231 |
| G.Tate | 0.3750000 | 32 | 6.312500 | 5.531250 |
| S.Shepard | 0.3571429 | 28 | 5.892857 | 5.607143 |
| TE | ||||
| T.Kelce | 0.6296296 | 27 | 10.111111 | 5.703704 |
| E.Ebron | 0.5357143 | 28 | 8.785714 | 5.821429 |
| J.Cook | 0.5185185 | 27 | 8.555556 | 5.851852 |
| E.Engram | 0.4705882 | 17 | 11.058824 | 5.352941 |
| Z.Ertz | 0.4000000 | 35 | 5.571429 | 5.742857 |
| RB | ||||
| C.McCaffrey | 0.5263158 | 19 | 8.684211 | 5.052632 |
| J.White | 0.4285714 | 28 | 5.250000 | 5.821429 |
| S.Barkley | 0.4000000 | 20 | 6.500000 | 4.600000 |
| A.Kamara | 0.3913043 | 23 | 6.043478 | 6.173913 |
| D.Johnson | 0.3636364 | 33 | 4.757576 | 5.363636 |
| E.Elliott | 0.3333333 | 15 | 3.533333 | 5.333333 |
And then really amp it up with further customizations!
third_conv_table %>%
tab_header(
title = "3rd Down Conversion Rates (Slot WR vs RB vs TE)",
subtitle = "Yds to go <= 10, N of Plays >= 10"
) %>%
fmt_percent(.,
columns = vars(conv_rate),
decimals = 1
) %>%
fmt_number(
columns = vars(mean_yardage, mean_ydstogo),
decimals = 1
) %>%
cols_label(
receiver_player_name = "Player",
mean_yardage = "Yds Gained",
mean_ydstogo = "Yds to Go",
n = "Plays",
conv_rate = "Conversion Rate"
) %>%
cols_align(
align = "center"
) %>%
tab_source_note(
source_note = "Table: @thomas_mock | Data: @nflscrapR"
) %>%
tab_footnote(
footnote = "Average Yards",
locations = cells_column_labels(
columns = vars(mean_yardage, mean_ydstogo)
)
)
| 3rd Down Conversion Rates (Slot WR vs RB vs TE) | ||||
|---|---|---|---|---|
| Yds to go <= 10, N of Plays >= 10 | ||||
| Player | Conversion Rate | Plays | Yds Gained1 | Yds to Go1 |
| WR | ||||
| C.Kupp | 66.7% | 12 | 14.5 | 5.8 |
| W.Snead IV | 64.7% | 17 | 8.3 | 5.9 |
| M.Sanu | 63.6% | 22 | 7.8 | 5.6 |
| T.Lockett | 63.2% | 19 | 15.1 | 5.7 |
| T.Gabriel | 62.5% | 16 | 8.8 | 6.2 |
| D.Westbrook | 62.1% | 29 | 7.5 | 5.6 |
| A.Humphries | 55.0% | 20 | 5.8 | 5.5 |
| C.Beasley | 53.8% | 26 | 8.2 | 6.0 |
| J.Edelman | 40.0% | 20 | 5.5 | 4.8 |
| D.Pettis | 38.5% | 13 | 11.4 | 6.8 |
| G.Tate | 37.5% | 32 | 6.3 | 5.5 |
| S.Shepard | 35.7% | 28 | 5.9 | 5.6 |
| TE | ||||
| T.Kelce | 63.0% | 27 | 10.1 | 5.7 |
| E.Ebron | 53.6% | 28 | 8.8 | 5.8 |
| J.Cook | 51.9% | 27 | 8.6 | 5.9 |
| E.Engram | 47.1% | 17 | 11.1 | 5.4 |
| Z.Ertz | 40.0% | 35 | 5.6 | 5.7 |
| RB | ||||
| C.McCaffrey | 52.6% | 19 | 8.7 | 5.1 |
| J.White | 42.9% | 28 | 5.2 | 5.8 |
| S.Barkley | 40.0% | 20 | 6.5 | 4.6 |
| A.Kamara | 39.1% | 23 | 6.0 | 6.2 |
| D.Johnson | 36.4% | 33 | 4.8 | 5.4 |
| E.Elliott | 33.3% | 15 | 3.5 | 5.3 |
| Table: @thomas_mock | Data: @nflscrapR | ||||
|
1
Average Yards
|
||||
Thanks again for looking through this and hopefully this is helpful, if you have any suggestions - feel free to reference the GitHub repo and share additional examples!
This work, “Tom’s Cookbook for Better Viz”, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.